{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Code based on the notebook created by Laura Nelson at https://github.com/lknelson/DH-Institute-2017/blob/master/07-Word2Vec/Word2Vec.ipynb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Prep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import the necessary libraries\n",
    "\n",
    "#Data Wrangling\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import string\n",
    "import os\n",
    "from nltk.tokenize import word_tokenize, sent_tokenize\n",
    "from random import choices\n",
    "\n",
    "import gensim #library needed for word2vec\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define some functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fast_tokenize(text):\n",
    "    \n",
    "    # Get a list of punctuation marks\n",
    "    punct = string.punctuation + '“' + '”' + '‘' + \"’\"\n",
    "    \n",
    "    lower_case = text.lower()\n",
    "    lower_case = lower_case.replace('—', ' ').replace('\\n', ' ')\n",
    "    \n",
    "    # Iterate through text removing punctuation characters\n",
    "    no_punct = \"\".join([char for char in lower_case if char not in punct])\n",
    "    \n",
    "    # Split text over whitespace into list of words\n",
    "    tokens = no_punct.split()\n",
    "    \n",
    "    return tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Import and Pre-Processing\n",
    "\n",
    "### Import Data\n",
    "\n",
    "For anonymization, the raw data is not provided in the Dataverse, but the following code is provided to show how the `semi_anonymized_comments.csv` file was created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('../data/comments_filtered.csv', encoding='utf-8', \n",
    "                 dtype={'score':int},\n",
    "                 low_memory=False)\n",
    "\n",
    "df = df.dropna(subset=['body'])\n",
    "df = df[df.body != '[deleted]']\n",
    "df = df[df.body != '[removed]']\n",
    "df = df[df.author != 'AutoModerator']\n",
    "df = df[df.author != 'TotesMessenger']\n",
    "# Remove all authors with bot in their name (probably both false positives and negatives)\n",
    "df = df[~df.author.str.contains('bot', case=False)]\n",
    "df = df[df.author != 'autotldr']\n",
    "df = df[~df.body.str.contains(\"I'm a bot\")]\n",
    "df['body'] = df.body.str.replace(r'&[gl]t', '')\n",
    "\n",
    "df.drop(['id', 'author'], axis=1, inplace=True)\n",
    "\n",
    "df.to_csv('../data/semi_anonymized_comments.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Start Here\n",
    "\n",
    "This code should work as is, but you may need to adjust the file paths. This imports the data and does some basic pre-processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('../data/semi_anonymized_comments.csv', encoding='utf-8', \n",
    "                 dtype={'score':int},\n",
    "                 low_memory=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Month</th>\n",
       "      <th>Day</th>\n",
       "      <th>subreddit</th>\n",
       "      <th>body</th>\n",
       "      <th>score</th>\n",
       "      <th>post_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2020</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>Masks4all</td>\n",
       "      <td>Don’t buy Chinese masks</td>\n",
       "      <td>1</td>\n",
       "      <td>gbadkf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2020</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>Masks4all</td>\n",
       "      <td>Some people like to bitch.\\n\\nI’m just glad he...</td>\n",
       "      <td>2</td>\n",
       "      <td>gb2hs4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2020</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>Masks4all</td>\n",
       "      <td>to be fair people are still mostly bitching at...</td>\n",
       "      <td>1</td>\n",
       "      <td>gb2hs4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2020</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>Masks4all</td>\n",
       "      <td>**invents time travel to avoid the no mask inc...</td>\n",
       "      <td>3</td>\n",
       "      <td>gb2hs4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2020</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>Masks4all</td>\n",
       "      <td>*Wearing a mask both times when you are suppos...</td>\n",
       "      <td>1</td>\n",
       "      <td>gb2hs4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>516681</th>\n",
       "      <td>2020</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>LockdownSkepticism</td>\n",
       "      <td>Good job. Just try to stay open to new evidenc...</td>\n",
       "      <td>2</td>\n",
       "      <td>jm9pmh</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>516682</th>\n",
       "      <td>2020</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>LockdownSkepticism</td>\n",
       "      <td>Now that I think about it SUB SAHARAN AFRICA d...</td>\n",
       "      <td>1</td>\n",
       "      <td>jmc56m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>516683</th>\n",
       "      <td>2020</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>LockdownSkepticism</td>\n",
       "      <td>Trying to be on the right side of history is n...</td>\n",
       "      <td>1</td>\n",
       "      <td>jm9pmh</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>516684</th>\n",
       "      <td>2020</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>LockdownSkepticism</td>\n",
       "      <td>Totally following the science. Having a GOP le...</td>\n",
       "      <td>1</td>\n",
       "      <td>gf2ghm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>516685</th>\n",
       "      <td>2020</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>LockdownSkepticism</td>\n",
       "      <td>Tegnell has to pretend herd immunity doesn’t e...</td>\n",
       "      <td>7</td>\n",
       "      <td>jmefbv</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>516686 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        Year  Month  Day           subreddit  \\\n",
       "0       2020      5    1           Masks4all   \n",
       "1       2020      5    1           Masks4all   \n",
       "2       2020      5    1           Masks4all   \n",
       "3       2020      5    1           Masks4all   \n",
       "4       2020      5    1           Masks4all   \n",
       "...      ...    ...  ...                 ...   \n",
       "516681  2020     11    2  LockdownSkepticism   \n",
       "516682  2020     11    2  LockdownSkepticism   \n",
       "516683  2020     11    2  LockdownSkepticism   \n",
       "516684  2020     11    2  LockdownSkepticism   \n",
       "516685  2020     11    2  LockdownSkepticism   \n",
       "\n",
       "                                                     body  score post_id  \n",
       "0                                 Don’t buy Chinese masks      1  gbadkf  \n",
       "1       Some people like to bitch.\\n\\nI’m just glad he...      2  gb2hs4  \n",
       "2       to be fair people are still mostly bitching at...      1  gb2hs4  \n",
       "3       **invents time travel to avoid the no mask inc...      3  gb2hs4  \n",
       "4       *Wearing a mask both times when you are suppos...      1  gb2hs4  \n",
       "...                                                   ...    ...     ...  \n",
       "516681  Good job. Just try to stay open to new evidenc...      2  jm9pmh  \n",
       "516682  Now that I think about it SUB SAHARAN AFRICA d...      1  jmc56m  \n",
       "516683  Trying to be on the right side of history is n...      1  jm9pmh  \n",
       "516684  Totally following the science. Having a GOP le...      1  gf2ghm  \n",
       "516685  Tegnell has to pretend herd immunity doesn’t e...      7  jmefbv  \n",
       "\n",
       "[516686 rows x 7 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the two dataframes, one for each subreddit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "m4a_comments = df.loc[(df.subreddit == 'Masks4all') & (df.score > 0) & (pd.notna(df.body)), ['body']]\n",
    "lds_comments = df.loc[(df.subreddit != 'Masks4all') & (df.score > 0) & (pd.notna(df.body)), ['body']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-Processing\n",
    "\n",
    "Break into sentences, then words. Remove punctuation, make lowercase, and remove stopwords."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def words_by_sentence(data):\n",
    "    words_by_sentence = []\n",
    "    for comment in data:\n",
    "        if comment == '':\n",
    "            continue\n",
    "        for sentence in sent_tokenize(comment):\n",
    "            words_by_sentence.append(fast_tokenize(sentence))\n",
    "    words_by_sentence = [sentence for sentence in words_by_sentence if sentence != []]\n",
    "    print(words_by_sentence[0])\n",
    "    return words_by_sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['dont', 'buy', 'chinese', 'masks']\n"
     ]
    }
   ],
   "source": [
    "m4a_data = words_by_sentence(m4a_comments.body)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['i', 'dont', 'either']\n"
     ]
    }
   ],
   "source": [
    "lds_data = words_by_sentence(lds_comments.body)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Word2Vec\n",
    "\n",
    "### Word2Vec Features\n",
    "<ul>\n",
    "<li>Size: Number of dimensions for word embedding model</li>\n",
    "<li>Window: Number of context words to observe in each direction</li>\n",
    "<li>min_count: Minimum frequency for words included in model</li>\n",
    "<li>sg (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram</li>\n",
    "<li>Alpha: Learning rate (initial); prevents model from over-correcting, enables finer tuning</li>\n",
    "<li>Iterations: Number of passes through dataset</li>\n",
    "<li>Batch Size: Number of words to sample from data during each pass</li>\n",
    "</ul>\n",
    "\n",
    "### Train Model\n",
    "\n",
    "This is the code that actually creates the word embeddings, and saves them as two separate model files. It may take a while to run (on the order of a few minutes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "m4a_model = gensim.models.Word2Vec(m4a_data, vector_size=100, window=5,\n",
    "                               min_count=10, sg=1, alpha=0.025, epochs=5, batch_words=10000, workers=4)\n",
    "\n",
    "lds_model = gensim.models.Word2Vec(lds_data, vector_size=100, window=5,\n",
    "                               min_count=10, sg=1, alpha=0.025, epochs=5, batch_words=10000, workers=4)\n",
    "\n",
    "# Save model for later use\n",
    "m4a_model.wv.save_word2vec_format('../data/word2vec_m4a_clean.txt')\n",
    "lds_model.wv.save_word2vec_format('../data/word2vec_lds_clean.txt')"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "teaching",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
